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Abstract. pymorphy2 is a morphological analyzer and generator for 
Russian and Ukrainian languages. It uses large efficiently encoded lexi¬ 
cons built from OpenCorpora and LanguageTool data. A set of linguisti¬ 
cally motivated rules is developed to enable morphological analysis and 
generation of out-of-vocabulary words observed in real-world documents. 
For Russian pymorphy2 provides state-of-the-arts morphological analysis 
quality. The analyzer is implemented in Python programming language 
with optional C+-|- extensions. Emphasis is put on ease of use, documen¬ 
tation and extensibility. The package is distributed under a permissive 
open-source license, encouraging its use in both academic and commer¬ 
cial setting. 
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1 Introduction 

Morphological analysis is an analysis of internal structure of words. For languages 
with rich morphology like Russian or Ukrainian using the morphological analysis 
it is possible to figure out if a word can be a noun or a verb, or if it can be 
singular or plural. Morphological analysis is in an important step of natural 
language processing pipelines for such languages. 

Morphological generation is a process of building a word given its gram¬ 
matical representation; this includes lemmatization, inflection and finding word 
lexemes. 

pymorphy2 is a morphological analyzer and generator for Russian and Ukrainian 
language widely used in industry and in academia. It is being developed since 
2012; Ukrainian support is a recent addition. The development of its predecessor, 
pymorphj0 started in 2009. The package is availabl^l under a permissive license 
(MIT), and it uses open source permissively licensed dictionary data. 

The rest of this paper is organized as follows. In Section 2 pymorphy2 soft¬ 
ware architecture and design principles are described. Section 3 explains how 

^ https://bitbucket.com/kmike/pymorphy 
^ https://github.com/kmike/pymorphy2 



pymorphy2 uses lexicons and how analysis and morphological generation work 
for vocabulary words. In Section 4 methods used for out-of-vocabulary words are 
explained and compared with approaches used by other morphological analyzers. 
Section 5 is dedicated to a problem of selecting correct analysis from all possible 
analyses, and a role of morphological analyzer in this task. In Section 6 evalu¬ 
ation results are presented. Section 7 outlines a roadmap for future pymorphy2 
improvements. 


2 Software Architecture 


pymorphy2 is implemented as a cross-platform Pythoij^ library, with a command¬ 
line utility and optional C-|—|- extensions for faster analysis. Both Python 2.x and 
Python 3.x are supported. An extensive testing suite (600+ unit tests) ensures 
the code quality; test coverage is kept above 90%. There is online documentatior0 
available. 

When optional C-|—h extension is used (or when pymorphy2 is executed using 
PyPjH Python interpreter) the parsing speed is usually in tens of thousands of 
words per second; in some specific cases in can exceed 100000 words per second 
in a single thread. Without the extension parsing speed is in thousands of words 
per second. The memory consumption is about 15MB, or about 30MB if we 
account for Python interpreter itself. 

Users are provided with a simple API for working with words, their analyses 
and grammatical tags. There are methods to analyze words, inflect and lemma- 
tize them, build word lexemes, make words agree with a number, methods for 
working with tags, grammemes and dictionaries. Inherent complexity of working 
with natural languages is not hidden from the user. For example, to lemma- 
tize the word correctly it is necessary to choose the correct analysis from a list 
of possible analyses; pymorphy2 provides P[analysis\word) estimates and sorts 
the results accordingly, but requires user to choose the analysis explicitly before 
normalizing the word. 

Analysis of vocabulary words and out-of-vocabulary words is unified. There 
is a configurable pipeline of "analyzer units"; it contains a unit for vocabulary 
words analysis and units (rules) for out-of-vocabulary words handling. Individ¬ 
ual units can be customized or turned off; some rules are parametrized with 
language-specific data. Users can create their own analyzer units (rules). This 
all makes it possible to perform morphological analysis experiments without 
changing pymorphy2 source code, develop domain-specific morphology analysis 
pipelines and adapt pymorphy2 to work with languages other than Russian. The 
latter point is validated by introducing an experimental support for Ukrainian 
language. 


^ https://www.python.org/ 
http://pymorphy2.readthedocs.org 
® http://pypy.org/ 



3 Analysis of Vocabulary Words 


pymorphy2 relies on large lexicons for analysis of common words. For Russian 
it uses OpenCorpora [3] dictionary ^ 5 * 10® word forms, ^ 0.39 * 10® lemmas) 
converted from OpenCorpora XMlJj format to a compact representation opti¬ 
mized for morphological analysis and generation tasks. End users don’t have to 
compile the dictionaries themselves; pymorphy2 ships with prebuilt periodically 
updated dictionaries. 

Any dictionary in OpenCorpora XML format can be used by pymorphy2. 
For Ukrainian there is such experimental dictionary 2.5 * 10® word forms) 
being developecfl by Andriy Rysin, Dmitry Chaplinsky, Mariana Romanyshyn 
and other contributors; it is based on LanguageTooH data. 

Source dictionary contains word forms with their tags, grouped by lexemes. 
For example, a lexeme for lemma "ex" (a hedgehog) looks like this: 


eac 

NOUN,emim,masc 

sing,nomn 

eaca 

NOUN,emim,masc 

sing,gent 

eacy 

NOUN,emim,masc 

sing,datv 

eacaMH 

NOUN,cUiim,masc 

plur,ablt 

eacax 

NQUN,cUiim,masc 

plur,loct 


In source dictionaries there could also be links between lexemes. For example, 
lexemes for infinitive, verb, gerund and participle forms of the same lemma may 
be connected. Currently pymorphy2 joins connected lexemes into a single lexeme 
for most link types. 


3.1 Morphological Analysis and Generation 

Given a dictionary, to analyze a word means to find all possible grammatical 
tags for a word. Obtaining of a normal form (lemmatization) is finding the first 
word form in the lexeme. To inflect a word is to find another word form in the 
same lexeme with the requested grammemes. 

As can be seen, all these tasks are simple. With an XML dictionary analysis 
of known words can be performed just by running queries on XML file. 

The problem is that querying XML is 0(N) with large constant factors, raw 
data takes quite a lot of memory, and the source dictionary is not well suited for 
morphological analysis and generation of out-of-vocabulary words. 

To create a compact representation and enable fast access pymorphy2 encodes 
lexeme information: all words are stored in a DAFSA [5] using the dawgdic@ C++ 
library m via Python wrappei0; information about word tags and lexemes is 

® http: //opencorpora.org/?page=export 
Conversion utilities: https://github.eom/dchaplinsky/LT20penCorpora 
® https: //languagetool.org/ 

^ https://code.google.com/p/dawgdic/ 
https://github.com/kmike/DAWG 




encoded as numbers. Storage scheme is close to the scheme described in aot.ru 
m, but it is not quite the same. 

Paradigms Paradigm in pymorphy2 is an inflection pattern of a lexeme. It 
consists of prefixi,suffixi,tagi triples, one for each word form in a lexeme, 
such as that each word form i can be represented as prefixi + stem + suffixt 
where stem is the same for all words in a lexeme. 

This representation allows us to factorize a lexeme into a stem and a paradigm. 
Paradigm prefixes, suffixes and tags are encoded as numbers by pymorphy2; 
lexeme stems are discarded. It means that a paradigm is stored as an array of 
numbers (prefixes, suffixes and tags IDs), and lexemes are not stored explicitly 
- they are reconstructed on demand from word and paradigm information. 

There are no paradigms provided in the source dictionary; pymorphy2 infers 
them from the lexemes. For Russian there are about 3200 paradigms inferred 
from 390000 lexemes. 

Word Storage Word forms with their analysis information are stored in a 
DAFSA. Other storage schemes were tried, including two tries scheme similar to 
described in [S] (but using double-array tries), and succinct (MARISAF^ tries. 
For pymorphy2 data DAFSA provided the most compact representation, and at 
the same time it was the fastest and had the most flexible iteration support. 



Fig. 1. DAFSA encoding example. Encoded (word, paradigmid, formindex) triples: 
(flBop, 103, 0); (eac, 104, 0); (flBopnnK, 101, 2); (flBopnnK, 102, 2); (eacHK, 101, 2); 
(eacHK, 102, 2) 


For each word form pymorphy2 stores (word, paradigmid, formindex) triples: 

— word form, as text; 

— ID of its paradigm; 

— word form index in the lexeme. 

DAFSA doesn’t support attaching values to leaves; the information is en¬ 
coded like the following: < word > SEP < paradigmid >< formindex > (see 
an example on fig. flTl^T 

https://code.google.com/p/marisa-trie/ 

pymorphy2 encodes words to UTF-8 before putting them to DAFSA, so in practice 
there are more nodes than shown on fig. [T] It is an implementation detail. 











The storage is especially efficient because words with similar endings often 
have the same analyses, i.e. the same {paradigmid, formindex) pairs; this al¬ 
lows DATS A to use fewer nodes/transitions to represent the data. DATS A for 
Russian OpenCorpora dictionary (5 * 10® analyses, about 3 * 10® unique word 
forms) enables fast lookups (hundreds thousand lookups/sec from Python) and 
takes less than 7MB of RAM; source XML file is about 400MB on disk. 

To get all analyses of a word, DAFSA transitions for word are followed, then 
a separator SEP is followed, and then the remaining subtree is traversed to get 
all possible {paradigmid, formindex) pairs. 

Given {paradigmid, formindex) pair one can find the grammatical tag of a 
word: find a paradigm in paradigms array by paradigmid, get {prefixi, suffixi, tagi) 
triple from a paradigm by using i := formindex. Given {paradigmid, formindex) 
pair and the word itself it is possible to restore the lexeme and lemmatize or 
inflect the word - from word, prefiXi and suffiXi we can get the stem, and 
given a stem and {prefixk, suffixk, tagu) it is possible to restore a full word for 
k-th word form. 


3.2 Working with "e" and "t" Characters Efficiently 

The usage of "e" letter is optional in Russian; in real texts it is often replaced 
with "e" letter. There rules for "t" / "r" substitutions are different in Ukrainian, 
but in practice there are real-world texts with "t" letters replaced with "r". 

The simplest way to handle it is to replace "e" / "t" with "e" / "r" both 
in the input text and in the dictionary. However, this is suboptimal because 
it discards useful information, makes the text less correct (in Ukrainian "r" 
instead of "t" can be seen as a spelling error) and increases the ambiguity: there 
are words which analysis should depend on e/e and t/r. For example, the word 
"nee" should be parsed as plural, but the word "ece" shouldn’t. 

pymorphy2 assumes that "e" / "t" usage in dictionary is mandatory, but in 
the input text it is optional. For example, if a Russian input word contains "e" 
letter then only analyses with this letter are returned; if there are "e" letters in 
the input word then possible analyses both for "e" and "e" are returned. 

An easy way to implement this would be to check each combination of e/e 
and t/r replacement for the input word. It is not how pymorphy2 works. To 
do the task efficiently, pymorphy2 exploits DAFSA [S] dictionary structure: the 
result is built by traversing the word character graph and trying to follow "e" 
transitions in addition to "e" transitions (for Russian) and "t" transitions in 
addition to "r" transitions (for Ukrainian). 


4 Analysis of Out-of-Vocabulary Words 

It is not practical to try incorporate all the words in a lexicon - there is a long 
tail of rarely used words, new words appear; there is morphological derivation, 
loanwords, it is challenging to add all names, locations and special terms to the 
dictionary. Empirically, Zipf’s Law seems to hold for natural languages [14]; one 


of the consequences is that even doubling the size of a lexicon could increase the 
coverage only slightly [B]. 

For languages without rich morphology it may be practical to assume that 
if word is not in a dictionary then it can be of any class from the open word 
classes, and then disambiguate the results on later processing stages, using e.g. a 
contextual POS tagger or a syntactic parser. For Slavic languages doing this on 
later stages is challenging because of large tagsets - for example, OpenCorpora 
131 words have more than 4500 different tags. Morphological analyzer solves it 
by limiting the number of possible analyses based on word shape. 

pymorphy2 uses a set of rules (analyzer units) to handle unknown words. 
Some of the rules are described in literature |8| 9 | 10|6 |4]; the resulting combina¬ 
tion is novel. The order of in which the rules are applied is language-specific. 

4.1 Common Prefixes Removal 

There is a set of immutable prefixes which can be attached to words of open 
classes (nouns, verbs, adjectives, adverbs, participles, gerunds) without affecting 
the word grammatical properties. Examples of such prefixes for Russian: "ne", 
"nceB;i,o", "aBua"; pymorphy2 provides language-specific lists of these prefixes. 

When a words starts with one of these prefixes, pymorphy2 removes the 
prefix, parses the reminder and re-attaches the prefix. A similar rule is described 
in [S]. Note that full analysis is performed on the reminder, so the reminder can 
be an out-of-vocabulary word itself. To speedup prefix matching built-in lists of 
prefixes are encoded to DAFSAs. 

4.2 Words Ending with Other Dictionary Words 

When all the following apply pymorphy2 assumes the whole word can be parsed 
the same way as the "sufhx" word: 

— a word being analyzed has another word from a dictionary as a suffix; 

— the length of this "suffix" word is greater than 3; 

— the length of the word without the "suffix" is no greater than 5; 

— "suffix" word is of an open class (noun, verb, adjective, participle, gerund) 

To search for suffixes pymorphy2 tries to consider 1st letter as a prefix, then 
two first letters as a prefix, etc., and lookups the reminder in a dictionary. 

This rule is the same as described in [ini. A similar rule is described in [B], 
though its induction for concrete prefixes is different. 

4.3 Endings Matching 

In many languages, including Russian and Ukrainian, words with common end¬ 
ings often have the same grammatical form. 

To exploit this, pymorphy2 first collects the information from the dictionary: 
for each word all endings of length 1 to 5 are extracted, and all possible analyses 
for these endings are stored. Then this ending {analyses} mapping is cleaned 
up: 




— only the most frequent analyses for each POS tag are kept; 

— analyses from non-productive paradigms (currently these are paradigms which 

produced less than 3 lexemes in a dictionary) are discarded; 

— rare endings (currently the ones which occur once) are also discarded. 

The resulting mapping is encoded to DAPS A for fast lookups. Storage scheme 
is the following: < ending > SEP < analysisinfo >, where analysisinfo con¬ 
sists of three 2-byte numbers: [frequency,paradigmld, formindex) - analysis 
frequency (a number of times a word with this ending had this analysis), ID of 
analysis paradigm and the form index inside the paradigm. 

At prediction time pymorphy2 checks word endings from length 5 to 1, stop¬ 
ping at the first ending with some analyses found. To get possible analyses for a 
given ending pymorphy2 first follows all DAFSA transitions for the ending, then 
follows a separator, and then traverses the remaining subtree to get possible 
analysisinfo triples. The result is then sorted by analyses frequencies. 

Recall that a word and a {paradigmid, formindex) pair is all what is 
needed to restore the lexeme and inflect the word. Lexemes are created on 
fly, so it doesn’t matter word is not from the vocabulary as soon as we have 
(paradigmid, formindex) pair. It means morphological generation (lemmatiza- 
tion, inflection) works here. 

Only analyses with open-class parts of speech (noun, verb, adjective, partici¬ 
ple, gerund) are produced. Special care is taken to handle "e" letter properly. 
Also, special care is required to handle paradigm prefixes properly - in fact, there 
are several ending —>■ {analyses} DAFSAs built, one per each paradigm prefix. 

This rule is based on [10]; similar approaches are also used in |4] and 0. m 
uses similar rules, but derives them differently. 

4.4 Words with a Hyphen 

Unlike some other morphological analyzers, pymorphy2 opts to handle words 
with a hyphen. 

In |7| it is argued that in most cases the parts of compound words should be 
handled as separate words if they are joined using a hyphen. A similar decision 
is made in OpenCorpora tokenization module |2]; it considers words like "kKan- 
Ilojib" as three tokens which should be analyzed separately and joined back 
at later processing stages. In both cases the decisions are not motivated by 
linguistic considerations; it is the technical difficulty which prevents analyzing 
and processing such words as single entities. 

Currently pymorphy2 handles adverbs with a hyphen, particles separated by 
a hyphen and compound words with left and right parts separated by a hyphen. 

Adverbs with a Hyphen Russian words are parsed as adverbs if they 

— start with a "no-" prefix; 

— have total length greater than 5; 

— can be parsed as a full singular adjective in dative case when "no-" is removed 

Examples: "no-cenepHOMy", "uo-xopomeMy". 


Particles Separated by a Hyphen Though it is not clear if words with a 
particle separated by a hyphen (e.g. "cMOTpn-Ka" or "nocMOxpeji-TaKH") should 
be handled as a single word or as two words, pymorphy2 supports parsing of 
such words. There are language-specific lists of common particles which can be 
attached, and if a word ends with one of these particles then it is parsed without 
the particle, and then the particle is re-attached to the result. 


Compound Words with a Hyphen The main challenge in analysis of the 
compound words which parts are separated by a hyphen (like "uejiOBeK-nayK" 
and "IJ,apb-nymKa") is to figure out if the left part should be inflected together 
with the right part, or if it is a fixed prefix. 

To do this, pymorphy2 parses left and right parts separately (they don’t have 
to be dictionary words). Then it tries to Hnd matching analyses. If there is a 
"left" analysis compatible with one of the "right" analyses then the resulting 
analysis is built where both word parts are inflected. After that, an analysis 
with a fixed left part is added to the result, regardless of whether a compatible 
"left" analysis was found or not. A similar method was used in [J. 

Only words with a single hyphen are handled using heuristics described 
above. Words with multiple hyphens are likely represent different phenomena 
in Russian and Ukrainian languages; they could be interjections or phrases m- 

4.5 Other Tokens 

Initial is an abbreviation of person’s first or patronymic name. In most cases an 
initial is a single upper-cased character (language-specific). pymorphy2 parses 
such characters as fixed singular nouns, with variants for all possible gender and 
case combinations. For person first names [Name) two different lexemes are 
built for male and female names. For patronymic names [Pair) a single lexeme 
is returned. Unlike all other analyzer rules, detection of initials is case-sensitive. 
It is a way to decrease ambiguity. 

The following tags are assigned to non-lexical tokens: PNCT for punctuation, 
LATN for tokens written in Latin alphabet, NUMB,intg for integer numbers, 
NUMB, real for floating-point numbers, ROMN for Roman numbers. 

When analyzing the text, it is common to classify tokens during the tok- 
enization step. The reason pymorphy2 handles non-lexical tokens during the 
morphological analysis step is that this allows users to use a simpler tokenizer 
when classifying tokenizer is not available; also, it means that information about 
all tokens is available in a common format. 

4.6 Morphological Generation of Ont of Vocabulary Words 

Inflection is fully supported for out of vocabulary words. To achieve this pymor- 
phy2 keeps track of the analyzer units (rules and their parameters) used to parse 
the word, requires each analyzer unit to provide a method for getting a lexeme, 
and calls this method for the last analyzer unit. To compute the lexeme analyzer 


unit can look at the analysis result, and it can ask previous analyzer units for 
the lexeme. 

For example, Common Prefixes Removal analyzer removes the prefix from 
a word, then gets a lexeme from the previous analyzer, and then attaches the 
prefix to each word form in a lexeme to build a resulting lexeme. 

5 Probability Estimation 

Morphological analyzer may return multiple possible word parses. The problem 
of choosing the right analysis from a list of possible options is called disam¬ 
biguation. Generally, to select the correct analysis it is required to take word 
context in account. Morphological analyzer takes individual words as an input, 
so it can’t disambiguate the result robustly. However, it can provide an estima¬ 
tion for P(analysis\word) conditional probability. Such probability estimations 
can be used in absence of a dedicated disambiguator to select the more probable 
analysis. In addition to that, these probabilities can be used on later stages of 
text analysis, for example by a disambiguator. 

To estimate P{analysis\word) conditional probability for Russian words py- 
morphy2 uses partially disambiguated OpenCorpora corpus [3] and assumes that 
P{analysis\word) = P{tag\word). The conditional probability is estimated for 
words which have multiple analysis according to pymorphy2, but have occur¬ 
rences with a single remaining analysis in the OpenCorpora corpus; the estima¬ 
tion is a maximum-likelihood estimation with Laplace (add-one) smoothing. 


Wdisambiguated ■= {word : \tagScorpus{word)\ = l,word e corpus} 

^^ambiguous •— }wOTd . \io,gSpyjjiorphy^i^^Ord'^^ ^ 1, tVOrd ^ ^^disambiguated} 
B{word) = mayi{\tagspymorphy2{word)\, \tagScorpus{word)\) 


\/wOrd e Wambig UOUS 1 


Vta 5 e tagspymorphy2{word) : 


PMLE{tag\word) 


count{word, tag) + 1 
count{word) + B{word) 


( 1 ) 


Counts are computed based on OpenCorpora corpus data; all words with a 
single remaining analysis are taken in account. 

Once estimated, the result is stored on disk as a DAFSA; keys are 


< word >:< tag >< NULL >< wt(10® PMLE{tag\word)) > 


For words without PMLE(tag\word) estimates the probabilities are assigned 
uniformly during the parsing. 

For Ukrainian language probabilities are assigned uniformly because at the 
moment of writing there is no a freely available Ukrainian corpus similar to 
OpenCorpora. 



6 Evaluation 


Evaluating analysis quality of different morphological analyzers for Russian is 
not straightforward because most analyzers (as well as annotated corpora) use 
their own incompatible tagsets. And when a corpus and a dictionary have a 
compatible tagset it usually means that the dictionary was enhanced from the 
corpus, and it is a problem because quality numbers obtained on a corpus the 
dictionary was enhanced from shouldn’t be relied on - they are too optimistic. 

pymorphy2 analysis quality was compared to an analysis quality of a well- 
known morphological analvzen^u. Mystem 3.0 [S]. Testing corpus consists of 100 
randomly selected sentences (1405 tokens) from OpenCorpora (microcorpu|3) 
and 100 randomly selected sentences (1093 tokens) from ruscorpora.ru - 2498 
manually disambiguated tokens in total. 

Full details for this evaluation can be found onlinj^. 

OpenCorpora (pymorphy2) tagset is not the same as ruscorpora.ru tagset, 
and ruscorpora.ru tagset differs from Mystem tagset. For evaluation purposes all 
tags were converted to Mystem format using a set of automatic rules. Quality 
was evaluated on full morphological tags, i.e. tags must match exactly to be 
considered correct, with a few exceptions related to tags conversion problems. 
All reported errors were checked manually to filter out false positives. 


Table 1. Errors 



pymorphy2 

mystem 3.0 

microcorpus 

10 

15 

ruscorpora 

9 

8 

total 

19 

23 


Both pymorphy2 and Mystem made less than 1% errors (without disambigua¬ 
tion, i.e. in less than 1% cases the correct analysis was not in a set of analyses 
returned by an analyzer). It should be noted that 9 out of 19 pymorphy2 errors 
and 14 out of 23 Mystem errors were related to abbreviation handling. Mystem 
handled first and last names better (1 mistake versus 4 for pymorphy2); pymor- 
phy2 made less mistakes for "regular" words (4 versus 6 for mystem). Mystem 
can’t parse many hyphenated words as a single token; such words were not con¬ 
sidered. Punctuation, numbers and non-Russian words were also removed from 
the input. 

It is hard to draw a quantitative conclusion because the corpus size is small. 
Both analyzers has a similar analysis quality, and the resulting numbers depend 
on evaluation minutiae: whether abbreviations are considered or not, should we 
require hyphenated words to be parsed, do we require verb transitivity to be 
predicted correctly, is it important to distinguish adverbs from parenthesis, etc. 

https://tech.yandex.ru/mystem/ 
https://github.com/kmike/microcorpus 

http://nbviewer.ipython.org/gist/kmike/52fb0a9b3ed627310bea 









Several human annotation errors were found by parsing OpenCorpora data 
with mystem (1 error) and ruscorpora data with pymorphy2 (6 errors). Open- 
Corpora shares a dictionary with pymorphy2, and ruscorpora annotation is re¬ 
lated to mystem; this shows an utility of using cross-corpora tools to check the 
annotations. 

The most sophisticated Russian morphological parser evaluation so far is [Th 
it happened in 2010. Previous version of pymorphy2 (pymorphy) Darticipated^u 
in tracks without disambiguation; it finished 1st on Full Morphology Analysis, 
3rd on Lemmatization, 3rd on POS tagging and 5th on the Rare Words track, 
pymorphy haven’t participated in disambiguation tracks. 

pymorphy used some pymorphy2 rules (not all) and a different dictionary 
(extracted from (TUI instead of m)- Generally, pymorphy2 should work better 
than pymorphy because of an improved dictionary and rules, but this has not 
been not measured quantitatively yet. 


7 Conclusion and Future Plans 

Permissive open-source license (MIT) is used for pymorphy2. All the dictionaries 
and corpora pymorphy2 depends on are also available under permissive open- 
source licenses. This encourages usage and contributions. There are volunteers 
working on Russian and Ukrainian dictionaries and corpora, related tools and 
pymorphy2 itself. 

Development of pymorphy2 is by no means finished. There are word classes 
for which pymorphy2 analysis can be improved. Some of them: people last and 
patronymic names, foreign people names, diminutive first names, locations, up¬ 
percase and other abbreviations, some classes of hyphenated words, ordinal num¬ 
bers (including ordinal numbers written in digit notation like "22 -h"). According 
to [T], similar issues are common for Russian morphological analyzers. 

Non-contextual P(tag\word) estimates can be made better by transferring 
some information about similar words and by improving the corpora. 

A better comparison between pymorphy, pymorphy2, Mystem and other mor¬ 
phological analyzers could require a robust tagset conversion library. 

The support for Ukrainian is experimental. The dictionary requires work, 
pymorphy2 needs more Ukrainian-specific rules for handling of out of vocabulary 
words, and for better P(tag\word) estimates an annotated Ukrainian corpus is 
needed: even a small corpus (or even a manually crafted frequency list) should 
fix a substantial amount of "obvious" errors. 

There are plans to add Belarusian language support to pymorphy2 based on 
Belarusian N-korpu^3 grammar database. 

Although pymorphy2 is already fast enough for many use cases (tens of thou¬ 
sands words per second in a single thread), there is a room for further speed 
improvements. 


Anonymized resnlts: http://ru-eval.ru/tables_index.html 
http://bnkorpus.mfo 
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